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Abstract 

Background: The automation of objectively selecting amino acid residue ranges for structure superpositions is 
important for meaningful and consistent protein structure analyses. So far there is no widely-used standard for 
choosing these residue ranges for experimentally determined protein structures, where the manual selection of 
residue ranges or the use of suboptimal criteria remain commonplace. 

Results: We present an automated and objective method for finding amino acid residue ranges for the 
superposition and analysis of protein structures, in particular for structure bundles resulting from NMR structure 
calculations. The method is implemented in an algorithm, CYRANGE, that yields, without protein-specific parameter 
adjustment, appropriate residue ranges in most commonly occurring situations, including low-precision structure 
bundles, multi-domain proteins, symmetric multimers, and protein complexes. Residue ranges are chosen to 
comprise as many residues of a protein domain that increasing their number would lead to a steep rise in the 
RMSD value. Residue ranges are determined by first clustering residues into domains based on the distance 
variance matrix, and then refining for each domain the initial choice of residues by excluding residues one by one 
until the relative decrease of the RMSD value becomes insignificant. A penalty for the opening of gaps favours 
contiguous residue ranges in order to obtain a result that is as simple as possible, but not simpler. Results are 
given for a set of 37 proteins and compared with those of commonly used protein structure validation packages. 
We also provide residue ranges for 6351 NMR structures in the Protein Data Bank. 

Conclusions: The CYRANGE method is capable of automatically determining residue ranges for the superposition 
of protein structure bundles for a large variety of protein structures. The method correctly identifies ordered 
regions. Global structure superpositions based on the CYRANGE residue ranges allow a clear presentation of the 
structure, and unnecessary small gaps within the selected ranges are absent. In the majority of cases, the residue 
ranges from CYRANGE contain fewer gaps and cover considerably larger parts of the sequence than those from 
other methods without significantly increasing the RMSD values. CYRANGE thus provides an objective and 
automatic method for standardizing the choice of residue ranges for the superposition of protein structures. 



Background 

Most proteins comprise structured and unstructured 
regions. It is important to identify these regions to 
meaningfully compare or analyze protein structures. The 
most commonly used similarity measure for three- 
dimensional structures are atomic root-mean-square 
deviation (RMSD) values, which are computed for all or 
a subset of atoms in two or more structures after their 
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optimal superposition, which is the one that minimizes 
the RMSD value [1]. For instance, NMR protein struc- 
tures are generally represented by a bundle of confor- 
mers that have been calculated starting from different 
randomized initial conformations using identical input 
data, and it has been proposed to represent also crystal- 
lographic structures by an ensemble of conformations 
[2]. The precision of an NMR protein structure is 
measured by the average RMSD value between the indi- 
vidual conformers and their average coordinates, com- 
puted after superimposing the conformers onto the first 
one. Both the superposition and the resulting RMSD 
values are strongly influenced by the choice of atoms 



O© 201 1 Kirchner and Guntert; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
BiolVlGCl C^ntrBl Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Kirchner and Guntert BMC Bioinformatics 201 1, 12:1 70 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/1 70 



Page 2 of 1 1 



that are included in the fit. Including unstructured parts 
of a structure yields large RMSD values that may 
obscure the presence of well-defined structured regions 
of the protein. It is therefore crucial to identify the 
structured regions. A (subjective) choice can be made by 
visually inspecting the structures, but for reasons of con- 
sistency, reproducibility and efficiency an automated 
method is highly desirable. In this paper we present a 
new method for this purpose that has several advantages 
over existing approaches. 

Various methods have been proposed to identify 
well-defined regions of a protein on the basis of the 
atomic coordinates of an ensemble of structures, such 
as the bundle of conformers resulting from an NMR 
structure determination [3-10]. Ordered regions can 
be identified by analyzing the local structure, for 
example by applying a cutoff on angular order para- 
meters for the backbone torsion angles cp and \\i [11]. 
Ordered regions found by these strictly local 
approaches cannot necessarily be superimposed simul- 
taneously with a low RMSD value. Other methods aim 
at finding the part(s) of a protein structure that are 
sufficiently similar to each other within the ensemble. 
Some algorithms rely on the distance variance matrix 
with elements D t j = a(djj) 2 , where a(djj) is the stan- 
dard deviation of the distance between the atoms i 
and computed over all structures in the comparison 
[3-6]. Small matrix elements indicate that the position 
of the corresponding groups of atoms is similar in all 
members of the structure bundle. Another approach, 
implemented in the molecular graphics program 
MOLMOL [12], superimposes the structures with the 
current set of atoms (starting with all atoms or a user- 
defined subset), and then discards in each step the 
atoms from the residue with the largest global displa- 
cement, until either the RMSD falls below a maximally 
acceptable value or a minimal number of residues is 
reached. In the field of protein structure prediction 
[13], two algorithms, LCS ("longest continuous seg- 
ments") and GDT ("global distance test"), have been 
established for detecting regions of local and global 
structure similarities [14]. The LCS procedure finds 
the longest continuous segment of residues that can 
fit under a given RMSD cutoff. The GDT algorithm 
searches for the largest (not necessarily continuous) 
set of residues that deviate by no more than a speci- 
fied distance cutoff [14]. Results are reported as 
the percentage of residues under a given distance cut- 
off. A popular measure is the "GDT total score", 
GDT_TS = (P 1 + P 2 + P 4 + P 8 )/4, where P d is the frac- 
tion of residues that can be superimposed under a dis- 
tance cutoff of d A, which reduces the dependence on 
the choice of the cutoff by averaging over four differ- 
ent distance cutoff values. 



Results and Discussion 
The CYRANGE algorithm 

The CYRANGE algorithm (Figure 1) yields residue 
ranges for the superposition of protein structures with 
the same sequence. The algorithm has been designed (i) 
to find residue ranges that are suitable for the global 
superposition of protein domains (rather than detecting 
local order), (ii) to provide simple residue ranges with 
no or only a small number of gaps, (iii) to include as 
many residues as reasonably possible, (iv) to be applic- 
able without change to structure bundles of high and 
low precision, (v) to be applicable to multi-domain pro- 
teins (without input specification of the domain bound- 
aries), (vi) to handle symmetric multimers and protein 
complexes, and (vii) to work with a single set of para- 
meters for all proteins. CYRANGE requires as input the 
Cartesian coordinates of two or more structures and 
consists of two main steps, domain identification and 
residue range determination for each domain. 



Calculate angular order parameters 



Determine core atoms 



Generate distance variance matrix 



Cluster core atoms 



Identify best clustering stage 



Q 



Find residues with largest displacement 



Compute gap-weighted RMSD decrease 



Remove one residue 



Fill small gaps 



Report optimal set of residues 



Figure 1 Flowchart. Flowchart of the CYRANGE algorithm for 
finding residue ranges for the global superposition of protein 
structures. 
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Domain identification 

Domain identification follows the approach used in the 
NMRCORE [6] and NMRCLUST [5] algorithms. Similar 
ideas have been used earlier [3,4,7,15]. Dihedral angle 
order parameter values of torsion angles from all con- 
formers are computed. A cutoff value is calculated from 
the order parameter list and applied to select the torsion 
angles that will be used to identify the "core atoms". 
The core atoms are located in locally well-defined 
regions of the structure bundle, and are therefore poten- 
tially involved in domains that can be superimposed 
with low RMSD. The variances of the intra-conformer 
distances between all core atoms are used to cluster the 
core atoms, which eventually yields the single or multi- 
ple domains present in the structure bundle. 
Angular order parameter calculation Dihedral angle 
order parameters [11] are calculated from all rotatable 
dihedral angles (except the peptide bond dihedral angle 
co). For a given torsion angle, the angular order parameter 
S with 0 < S < 1 is given by 



N 

k=i 

where the sum runs over the values 0^ of a torsion 
angle in all N structures in the comparison. The higher 
the local order the higher the value of S. 

If angular order parameters are computed from only 
the dihedral angles 4>, \|/, and x 1 the final results are lar- 
gely identical to those obtained with the approach 
described above (see Results). Using all rotatable dihe- 
dral angles was given precedence as this approach yields 
a larger number of core atoms, and thus a larger base 
for clustering, while still excluding atoms from severely 
disordered parts of the protein. 

Core atom determination The set of core atoms con- 
sists of the C a atoms of those residues that contain at 
least one well-ordered torsion angle with an angular 
order parameter 5 >S cut . We did not want to impose a 
fixed cutoff value S cut because the degree of order 
within structure bundles is different in each case. 
Instead, the cutoff value S cut is chosen as the angular 
order parameter value 5, of the torsion angle i that max- 
imizes the quantity 

c cmin 

Q, = (s-i) ^ 6 . - n . 

Here, s denotes the total number of torsion angles for 
which angular order parameters are calculated, 5 max and 
S mln are the maximal and minimal angular order para- 
meter values, and r,- e {l,...,s} is the rank of the torsion 
angle i in an ordered list of the angular order parameter 
values (e.g. the torsion angle with the smallest S t has 



rank r, = 1, the torsion angle with the largest Si has 
rank r t = 5). We denote the number of core atoms by C. 
Plots of Qt as a function of the order parameter rank i 
for various proteins (Additional File 1) show in all cases 
a clear maximum and the absence of distant, compar- 
ably high local maxima. 

We found that using only C a s is sufficient for reliable 
domain identification. Additional core atoms merely slo- 
wed down the calculations. We also attempted to simply 
use all C a atoms in the following clustering. This simple 
approach, however, was less reliable in domain identifi- 
cation than the present one based on angular order 
parameters. 

Distance variance matrix The variance Vy of the intra- 
conformer distance between any two core atoms i and /' 
is calculated over all N members of the structure bun- 
dle, 

1 N - 2 - 1 N 

V i} ■ = — ^2 [dyk - dtj) with dij = — ^ d^, 

fc=i fc=i 

where denotes the distance between atoms i and /' 
in conformer k. 

Core atom clustering To determine the residues that 
belong to the same domain the core atoms are clustered 
using an agglomerative hierarchical clustering algorithm. 
At the outset (clustering stage 1) each core atom forms 
a cluster of its own. At each subsequent stage of cluster- 
ing two clusters are merged, until at the end there 
remains a single cluster containing all core atoms. 
Hence there are as many clustering stages as there are 
core atoms. In each stage of clustering the two nearest 
neighbour clusters are identified and merged. All other 
clusters remain unchanged and are simply propagated 
to the next stage. The nearest neighbour clusters are 
defined by Ward's method as the two clusters that yield, 
after merging, the lowest intra-cluster V-value variance 
of all possible two-cluster combinations. The intra-clus- 
ter l/-value variance is computed as the variance of the 
Vy values for all atoms i and j in the merged cluster, or, 
if the merged cluster contains only two atoms, by the 
corresponding single Vk value. 

Identification of the best clustering stage At each of 
the clustering stages i = 2, C the average cluster 
spread 

Ai= — J^RMSDj 

is calculated, where the sum runs over all clusters with 
more than one member, c, is the total number of core 
atoms from all clusters with more than one member, 
and RMSD y is the average over all structures of the 
RMSD to the mean coordinates for the backbone atoms 



Kirchner and Guntert BMC Bioinformatics 201 1, 12:170 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/1 70 



Page 4 of 1 1 



N, C™, and C of the residues given by the core atoms in 
cluster All RMSD values are calculated using singular 
value decomposition [16]. Low cluster spreads indicate 
high intra-cluster homogeneity, i.e. atom pairs with a 
similar degree of inter-atom distance variation are likely 
to belong to the same structural unit within the protein. 
A low number of clusters points to no artificial division 
of domains having occurred. 

To determine the optimal clustering stage, the 
quantity 

, Ai-A min 

Pi = (C- 2) ■ r- + Hi 

is calculated for the clustering stages i = 2, C. A mm 
and ^4 max are the minimum and maximum average clus- 
ter spread values, respectively, and «, is the number of 
clusters at stage i, including single-element clusters. 

The clustering stage with the lowest P value is chosen 
as the optimal clustering stage, i*, provided that the 
average number of core atoms in a cluster at stage i* 
exceeds one eighth of the total number of core atoms, 
rounded up to the nearest integer. In the calculation of 
this average cluster size only those clusters are consid- 
ered which contain at least the minimal number of ele- 
ments see below) required for a cluster to be 
considered as a domain. If the average cluster size is too 
small, a new minimum value of P is determined in the 
restricted range i = i* + 1, C. The procedure is 
repeated until the average number of cluster elements at 
the clustering stage with minimum P exceeds one eighth 
of the total number of core atoms. The minimum of Pi 
as a function of the clustering stage i is sharply defined, 
usually at or near the last clustering stage (Additional 
File 2). 

Each cluster at the optimum stage of clustering is con- 
sidered as a domain, provided that it contains at least ^ 
elements. By default, fi = 8. 

The residue ranges corresponding to the identified 
domains are passed to the residue range determination 
procedure. First, however, these ranges are extended at 
all boundaries by m residues, so as not to restrict the 
range determination procedure to perhaps too narrow a 
starting range. By default, m = 3. 
Residue range determination 

The determination of the residue range for each domain 
starts with the residues of a previously identified and 
subsequently extended domain. The algorithm proceeds 
by iteratively removing residues until the set of residues 
does not change in an iteration. The algorithm com- 
prises the following 7 steps: 

(1) RMSD calculation: Compute the RMSD value, r, 
for the backbone atoms N, C a , and C of the current set 
of residues. 



(2) Removal of isolated residues: If present, exclude 
from the set of residues those with two neighbours that 
do not belong to the current set of residues, and start a 
new iteration at step 1. 

(3) Find residues with largest displacements: Among 
the selected residues find the one with the largest aver- 
age displacement whose removal does not open a new 
gap in the selected residues. Similarly, find the residue 
with the largest average displacement whose removal 
opens a new gap in the selected residues. The average 
displacement corresponds to the distance between an 
atom in a given conformer and its mean position after 
optimal superposition of all conformers onto the first 
conformer in the RMSD calculation of step 1, averaged 
over all conformers and over the backbone atoms N, 
C", and C of a residue. 

(4) Compute gap-weighted RMSD decrease: For the two 
residues found in step 3, compute the gap-weighted 
decrease of the RMSD value upon removing the residue 
from the set, Ar n ° 8ap = r - r" 08 ^ and Ar 8 ^ = y (r - r gap ), 
where r is the RMSD value from step 1, r™ 0 ^ and r 8 ^ are 
the RMSD of the selected residues after removing one of 
the residues from step 3, and 7 is a dimensionless para- 
meter that penalizes the opening of new gaps (if 7 < 1). 

(5) Residue removal: If the residue with the larger Ar 
value fulfils the two conditions Ar > 5 abs nIN and Ar/r 
> 8 rel nIN, remove it from the set of selected residues 
and start a new iteration at step 1. Here n and N denote 
the numbers of atoms included in the RMSD calculation 
of step 1 from the current residue and from all selected 
residues, respectively, and 8 abs and 8 rel are parameters 
for the minimally required absolute and relative RMSD 
decrease, respectively. 

(6) Retry residue removal: If no residue was removed 
in step 5, find among all selected residues the residue 
whose removal yields the largest gap-weighted decrease 
of the RMSD value. If the conditions of step 5 are ful- 
filled for this residue, remove it from the set of selected 
residues and start a new iteration at step 1. 

(7) Fill small gaps: If the set of selected residues con- 
tains small gaps of less than g (by default three) resi- 
dues, "fill" these gaps by additionally selecting the 
residues in the gap. 

Average displacements are calculated in step 3 to limit 
the number of RMSD calculations to two in step 4 and 
executing only rarely step 6 which requires an RMSD 
calculation for every selected residue. 

In this paper, unless noted otherwise, we used fA. = 8, 
m = 3, 7 = 0.4, 8 abs = 1.6 A, and 8 rel = 8 + 3.0/M with 
8 = 1.2, where M denotes the current number of 
selected residues, and g = 3. Smaller values of 7 lead to 
fewer gaps. The choice of 8 rel was motivated by the 
observation of the relative RMSD decrease values for 
randomly disordered structures. The increase of 8 rel for 
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small numbers of selected residues ensures the termina- 
tion of the algorithm. 
Output 

The CYRANGE method yields for each of the domains 
it identified the residue range(s) for superposition, the 
number of residues therein, and the average RMSD 
value to the mean coordinates for the backbone atoms 
in the residue range(s) for superposition. If the input 
consists of only two structures, the RMSD to the mean 
coordinates is equal to half the RMSD between the two 
structures. The CYRANGE algorithm is freely available 
as a web service and a stand-alone program at http:// 
www.bpc.uni-frankfurt.de/cyrange.html. It has also been 
implemented in the software package CYANA [17,18]. 

Application of CYRANGE to 37 proteins 

The residue ranges determined by CYRANGE for 37 pro- 
teins are visualized in the 3D structure bundles of Figure 2 
and summarized in Figures 3 and 4. Details regarding the 
proteins used are given in the Methods section. The 
method provided results that make sense by visual inspec- 
tion in that ordered regions were correctly identified, the 
global structure superpositions based on the CYRANGE 
residue ranges allow a clear presentation of the structure, 
and unnecessary small gaps within the selected ranges are 
absent. CYRANGE yielded similar results for the corre- 
sponding low- and high-precision structure bundles of the 
11 proteins of Figures 2a and 3, except that in enth and 
fsh2 additional loops, which are disordered only in the 
low-precision structure, were excluded, and that in the 
low-precision structure of smbp the second domain was 
not identified. For copz, a loop was excluded from the 
residue range for the high-precision structure that had 
been included for the low-precision structure because the 
residues are more uniformly disordered in the low-preci- 
sion structure than in the high-precision one. The latter 
exhibited a higher standard deviation of the RMSD per- 
residue than the low-precision structure. This shows that 
the CYRANGE method gives meaningful results also in 
the challenging case of low-quality structures where the 
distinction between well-defined and ill-defined regions 
becomes blurred. For the protein 2kr6, the CYRANGE 
method correctly identified the two structural domains 
despite the small size of the isolated helix (Figures 2b and 
4). The two protein-protein complexes, 2ktf and 2114, did 
not pose any particular problems. 

The CYRANGE results shown in Figures 2, 3, 4 have 
been obtained using the same parameter set for all pro- 
teins. We analyzed the influence of different values of the 
parameters on the resulting sequence coverage and RMSD 
for nine different protein structure bundles (Additional 
Files 3, 4, 5, 6, 7, 8). Each of the parameters \i, m, 5, 5 abs , 
y, and g was varied while keeping the other parameters at 
their default values. Varying the minimal cluster size fi in 




Figure 2 Structure superpositions Structure bundles 
superimposed for minimal backbone RMSD of the CYRANGE- 
determined residue ranges, indicated in red. Other residues are in 
black. Separate superpositions are shown for each domain, (a) 1 1 
proteins for which low-precision (left) and high-precision (right) 
NMR structures are available (see Methods), (b) 26 proteins, 23 of 
which had been used earlier for evaluating the FindCore algorithm. 



the range of 6-10 residues (Additional File 3), the domain 
boundary extension m in the range of 1-5 residues (Addi- 
tional File 4), or the minimal gap width g in the in the 
range of 1-5 residues (Additional File 5) did not change 
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Figure 3 Residue ranges for high- and low-precision NMR 
structures. Comparison of residue ranges found by the CYRANGE, 
PSVS, and FindCore methods for 1 1 proteins for which low-precision 
(cycle 1) and high-precision (final) NMR structures are available (see 
Methods). For each protein, the top line represents the complete 
amino acid sequence with the first and last residue labelled. Below 
are the residue ranges obtained from CYRANGE, PSVS, and 
FindCore. The corresponding RMSD values are indicated in A. 
Multiple domains are distinguished by thin and thick lines. 



the CYRANGE residue ranges for these 9 proteins, apart 
from the cases of lzda and the low-precision structure of 
copz, where raising m led to minute changes in sequence 
coverage. Variation of the parameter S for the relative 
RMSD decrease in step 5 of the residue range determina- 
tion in the range 0.5-4 led to changes only for the already 
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Figure 4 Residue range comparison Comparison of residue 
ranges found by the CYRANGE, PSVS, and FindCore methods for 26 
proteins. See Figures 2b and 3 for details. 



mentioned low-precision smbp structure (Additional File 
6). Higher values of the corresponding absolute RMSD 
decrease parameter 6 abs led to slight increases of the 
sequence coverage and the RMSD for some of the proteins 
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(Additional File 7). Similarly, a decrease of the sequence 
coverage and the RMSD were observed when increasing 
the gap penalty parameter y from 0 (no gaps allowed) to 4 
(gaps favoured) (Additional File 8). Thus the results from 
CYRANGE do not critically depend on the choice of these 
parameters but can, if desired, be guided towards a smaller 
or larger number of gaps or selected residues. The default 
values of the parameters appear to be appropriate for 
almost all proteins. Only the gap penalty parameter y may 
occasionally be adapted according to the emphasis put on 
simple residue ranges with no or few gaps. Meaningful 
values of y lie between 0, which yields a residue range 
without gaps, and 1, which selects residues without con- 
cern for the number of gaps. 

We also compared the residue ranges obtained by 
computing the angular order parameters at the outset of 
the algorithm for all rotatable dihedral angles or only 
for <j>, v|/, and y^, using default parameter settings. For all 
but 4 out of the 37 proteins in the test set the results 
were identical. In the 4 other cases the sequence cover- 
age, and usually the RMSD value, were higher when 
order parameters were calculated for all dihedral angles. 
The difference in coverage ranged from about 1 to 27%, 
and the difference in RMSD amounted to between 0 
and about 0.7 A. The 4 structure bundles for which dif- 
ferences became apparent were 2kr6, 2spz, the low-pre- 
cision structure of fspo, and the high-precision structure 
of smbp. In the case of 2kr6 the small second domain 
was not identified when only 4>, y, and % order para- 
meters were used. For smbp the differences amounted 
to only 1 or 2 residues per domain. For 2spz the 
CYRANGE version using only (f>, \\f, and % order para- 
meters unnecessarily identified 3 small domains instead 
of the single one reported by standard CYRANGE. For 
fspo a disordered loop was excluded if only 4>, and x 
order parameters were employed. Exclusion of this loop 
does seem sensible; raising the value of y in the standard 
CYRANGE will also bring about this result. Overall, 
both choices of dihedral angle order parameters yielded 
similar results. Where differences occurred, the results 
obtained with order parameters for all dihedral angles 
mostly corresponded better to the conclusions from 
visual inspection of the structure bundles. 

The computation time requirements of the CYRANGE 
algorithm are insignificant. It took CYRANGE 0.9 s to 
calculate the residue ranges for pbpa (142 residues, one 
domain), 1.0 s for scam (148 residues, two domains), 
and 5.8 s for smbp (370 residues, two domains) on a 
2.66 GHz Intel Core 2 processor. 

Comparison with PSVS and FindCore 

We compared the residue ranges from CYRANGE with 
those determined by the default algorithm of the Protein 
Structure Validation Suite PSVS, and by the FindCore 



algorithm. The algorithm in PSVS [19] is widely used 
for NMR protein structure validation, and was chosen 
here as a representative of the straightforward determi- 
nation of (locally) ordered residues. PSVS does therefore 
not attempt to identify structural domains. FindCore [4], 
determines residue ranges for global superposition and 
is able to identify multiple domains. 

The residue ranges obtained with the three algorithms 
for 37 different proteins are shown in Figures 3 and 4, 
and the differences of the RMSD and sequence coverage 
relative to the CYRANGE results are shown in Figure 5. 
In the majority of cases, the residue ranges from 
CYRANGE contain fewer gaps and cover significantly 
larger parts of the sequence than those from PSVS and 
FindCore. Consequently, the RMSD values for the resi- 
due ranges identified by CYRANGE are often slightly 
higher than those from PSVS and FindCore. This, how- 
ever, does not constitute a general rule. For instance, for 
the two-domain proteins lefc and ldld all CYRANGE 
domains simultaneously comprised more residues and 
showed lower RMSD values than those obtained from 
the other two algorithms (Figure 4). On average, the 
residue ranges reported by CYRANGE covered 85% of 
the sequences of these proteins and led to a backbone 
RMSD value of 0.77 A, as compared to 67% coverage 
and 1.72 A RMSD with PSVS, and 58% coverage and 
0.73 A RMSD with FindCore. 

In the proteins with multiple domains both CYRANGE 
and FindCore found all domains except one. CYRANGE 
missed a domain in the low-precision smbp structure, 
FindCore in 2kr6. Figure 5 compares the RMSD and 
sequence coverage of the PSVS and FindCore results 
with those obtained by CYRANGE. Each data point 

r \ 

12 T 
10-- 



so • ".m • -Jo 



Figure 5 RMSD and sequence coverage comparison. RMSD and 
sequence coverage differences between the residue ranges found 
by PSVS (red) or FindCore (blue) and CYRANGE for the proteins 
shown in Figures 3 and 4. The coverage is the percentage of amino 
acid residues included in the residue ranges found by the different 
methods. 
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represents one protein domain. Data points above/below 
the horizontal axis indicate cases where PSVS or Find- 
Core yielded larger/smaller RMSDs than CYRANGE. 
Data points to the right/left of the vertical axis indicate 
cases where PSVS or FindCore yielded larger/smaller 
sequence coverage than CYRANGE. Most data points are 
found near the horizontal axis in the lower left quadrant. 
For these proteins CYRANGE covered typically between 
10 and 50% more of the sequence with a small concomi- 
tant increase of the RMSD. Only a single data point, cor- 
responding to the low-precision smbp structure, is 
located in the lower right quadrant, indicating a signifi- 
cantly smaller sequence coverage and higher RMSD by 
CYRANGE (because the algorithm failed to identify the 
separate domains of this two-domain protein). In all 
other cases of higher sequence coverage by PSVS or 
FindCore the greater number of selected residues 
resulted in larger, often much larger RMSDs. There are 
also some cases in which, especially with FindCore, 
simultaneously a smaller sequence coverage and a larger 
RMSD were found than with CYRANGE. 

We have also correlated the sequence coverage and the 
RMSD values obtained by the three methods with the 
GDT total score, GDTTS, that reports the average per- 
centage of residues that can be superimposed under dis- 
tance cutoffs of 1, 2, 4, and 8 A [14]. Whereas there is a 
correlation between the sequence coverage by 
CYRANGE and the GDT TS value, no such correlation 
is apparent for FindCore or PSVS (Additional File 9). 
The sequence coverage by FindCore is around 60% for all 
proteins except for four cases with nearly 100% sequence 
coverage, and independent from the GDT TS value. The 
RMSD values do not correlate strongly with the GDT_TS 
values for any of the three methods (Additional File 10). 
This is not surprising because the GDT TS measures the 
fraction of residues that can be superimposed reasonably 
well, whereas the RMSD reports how well a given subset 
of residues, which may comprise a smaller or larger part 
of the entire protein sequence, can be superimposed. 

More detailed investigations into the differences 
between FindCore, PSVS, and CYRANGE results were 
performed, as the results, especially the number of gaps 
reported by the programs, vary considerably in some 
cases. Two examples of CYRANGE reporting few or no 
gaps where FindCore and PSVS report a high number of 
gaps are the results obtained for the low-precision struc- 
ture of pbpa, and for lbf8. In the first case CYRANGE 
excluded one disordered chain terminus, but it did not 
exclude a loop which, by visual inspection, seems slightly 
less ordered than the rest of the domain. PSVS kept the 
helices, yet excluded from them small, not considerably 
disordered-looking segments. It also excluded the chain 
termini, one of which does not seem highly disordered. 
The FindCore result was similar. Here we additionally 



found that some very small portions of otherwise excluded 
stretches were reported to be part of the domain. From 
lbf8 CYRANGE excluded one highly disordered loop. 
PSVS also excluded this loop, alongside parts which, by 
visual inspection, did not appear to be considerably disor- 
dered. Again, the FindCore result was similar to the one 
attained by PSVS, but the rather ordered-looking sections 
excluded by FindCore were often somewhat larger. 

Application to all NMR structures in the PDB 

On July 30, 2010 the PDB contained 6373 entries with 
protein structures determined by NMR that comprised at 
least 15 residues and for which a bundle of at least 5 con- 
formers was available. We applied CYRANGE to all of 
these structure bundles, and obtained results for all but 22 
files. In 4 cases no core atoms could be identified because 
the PDB files contained C a positions only, in 1 case the 
residue range determination failed, and in 17 cases no 
domains were found. 14 of those proteins comprised less 
than 20 residues and often consisted of a single helix with 
one or two disordered tails. Of the remaining three struc- 
tures two were highly disordered, and one, made up of 
25 residues, again contained a single helix with disordered 
tails only. The results for the remaining 6351 files are 
summarized in Figure 6. A complete list of the residue 
ranges and RMSD values is available at http://www.bpc. 
uni-frankfurt.de/cyrange_pdb.html. On average, the resi- 
due ranges covered 80% of the residues of a protein, and 
there were 1.07 domains and 0.59 intra-domain gaps per 
protein. 

For 95% of the proteins the CYRANGE residue ranges 
comprised more than 50% of all residues (Figure 6a). In 
only four cases (PDB IDs lr5s, 2fft, 2kes, 2v93) the resi- 
due ranges included less than 10% of all residues. Visual 
inspection of these structure bundles revealed that the 
low percentage of selected residues is correct for the two 
structures lr5s and 2fft, which consist of a single helix 
with long, disordered tails, whereas a larger residue range 
should have been selected for 2kes, which consists of a 
single a helix. The PDB entry 2v93 could not be handled 
properly because it combines multiple conformations 
within the individual conformers. 

Figure 6b shows that in 94% of structures CYRANGE 
identified a single domain. Two domains were found in 
5% of the structures, and in 69 cases (1%) CYRANGE 
found three or more (maximally 5) domains. In two 
cases (lzll and 2hyn) the structure was a pentamer. In 
the third case (2k27) visual inspection showed that the 
protein consists of two globular domains, which were 
correctly identified by CYRANGE with RMSDs of about 
1.1 A, and extended stretches at the chain termini and 
in the connection between the two globular domains, 
where CYRANGE identified three small "domains" of 
14-17 residues with RMSDs of 1.23-1.96 A. 
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Figure 6 Application to all NMR structures in the PDB Statistics 
of residue ranges determined with the CYRANGE algorithm for 6351 
NMR protein structure bundles in the Protein Data Bank as of July 
2010. (a) Percentage of residues in the residue range(s). (b) Number 
of domains, (c) Number of intra-domain gaps between the residue 
ranges, (d) RMSD values for the residue range. 



For 60% of the structures CYRANGE determined a 
residue range without intra-domain gaps (Figure 6c). 
Four or more intra-domain gaps were identified in only 
about 1% of the structures, i.e. the CYRANGE algorithm 
selected whenever possible simple residue ranges with 
no or only very few gaps. 

The distribution of the backbone RMSD values for the 
domains identified by CYRANGE (Figure 6d) indicates 
that 62% of all domains have an RMSD value below 



0.5 A. Less than 1% exhibit RMSD values above 2 A, 
and only five domains reported by CYRANGE are 
severely disordered with RMSD values in the range of 
4.0-5.5 A, which appear to be largely ill-defined also by 
visual inspection. This shows that in almost all cases 
CYRANGE determined residue ranges that can be 
superimposed well. Considering the large number of 
domains, it cannot be excluded that in some cases more 
appropriate residue ranges could be identified by visual 
inspection or other methods. Nevertheless, the facts that 
the algorithm failed to provide a result for only 0.34% of 
the PDB entries and that for more than 99% of the 
domains CYRANGE yielded residue ranges with RMSDs 
below 2 A and covering a significant fraction of the 
sequence indicate the usefulness of CYRANGE as a gen- 
eral tool for objective residue range determination. 

Conclusions 

We introduced the CYRANGE algorithm for the deter- 
mination of residue ranges for the superposition of pro- 
tein structures and showed that it provides meaningful 
results for a large number of NMR protein structure 
bundles. The algorithm needs no adaptations of para- 
meters for individual protein structures. CYRANGE is a 
tool for protein structure analysis that is available for 
easy integration in validation packages [19-23]. Although 
algorithms for the same purpose have been developed 
earlier [3-7,14], none of them has become a widely-used 
standard for experimentally determined protein struc- 
tures, where, despite a clear need for standardization 
and automation, the manual selection of residue ranges 
or the use of suboptimal criteria remain commonplace. 
We hope CYRANGE to fill this gap. 

Methods 

Use of FindCore and PSVS 

For comparison, also the Protein Structure Validation 
Software Suite (PSVS) [19] and the program FindCore 
[4] were employed to identify residue ranges and 
domains in our protein test set. The programs were 
used through the web portals http://psvs-l_4-dev.nesg. 
org (PSVS) and http://fps.nesg.org (FindCore). 

With PSVS the default option 'ordered residues' was 
selected for the residue selection for analysis. The residues 
reported as "ordered" were taken as the residue ranges 
identified by PSVS. Note that with these options PSVS 
does not identify multiple domains. 

"With FindCore, the 'average structure' was selected as 
the reference structure, the analysis was based on 'standard 
amino acids', and only backbone atoms were used in 
domain identification. The calculations were also per- 
formed using all instead of only backbone atoms, with lar- 
gely equivalent results (data not shown). FindCore reports 
the number of domains it identified, yet it does not 
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unambiguously state the boundaries of the domains. 
Instead, the program provides a list of 'core residues'. 
When the program reported more than one domain, we 
manually attributed the core residues to the individual 
domains. 

Analyses of the results 

PSVS and FindCore results were downloaded from the 
internet, and the output residue ranges were extracted 
from the source code of the downloaded web pages in 
an automated fashion. All reported RMSD values were 
calculated with CYANA for the backbone atoms N, C a , 
and C in the reported residue ranges, and with respect 
to the mean coordinates. With FindCore the RMSD 
value of each identified domain was calculated sepa- 
rately; for PSVS all reported 'ordered residues' were 
used in the RMSD calculations, as no domain informa- 
tion is provided by PSVS. The program MOLMOL [12] 
was used to visualize structures. 

Determination of the average GDTJTS value of each 
structure bundle 

The web server on http://proteinmodel.org/AS2TS/ 
LGA/lga.html (parameter set -3 -oO -d:4.0) was used for 
obtaining the GDTTS values whose averages are shown 
in the Additional Files 9 and 10. From each structure 
bundle all possible conformer pairs consisting of the 
first conformer and each subsequent conformer were 
subjected to the calculation. The results were extracted 
from the web site in an automated fashion, and for each 
structure bundle the average of the individual GDT TS 
values was computed. 

NMR protein structures 

The performance of CYRANGE was assessed on the basis 
of the NMR structure bundles of the 11 proteins, for 
which the NMR solution structure had been determined 
earlier. We refer to these proteins by four-letter codes: 
copz [24], PDB 1CPZ; cprp [25], PDB 1U3M, enth [26,27], 
PDB 1VDY; fsh2 [28,29], PDB 1WQU; fspo [30], PDB 
1VEX; pbpa [31], PDB 1GM0; rhod [32,33], PDB 1VEE; 
scam [34], PDB 1X02; smbp [34], PDB 2D21; wmkt [35], 
PDB 1WKT; ww2d [36], PDB 2DWV. 

The proteins copz, cprp, enth, fsh2, pbpa, rhod and 
wmkt are proteins with a well-defined single-domain 
structure. The protein fspo has an unusual, less well- 
defined fold without regular secondary structure. The 
proteins scam and smbp are proteins with two domains 
connected by a flexible linker. The protein ww2d forms 
a symmetric dimer. 

Two structure bundles were considered for each of 
these proteins: the final structure bundle, and the struc- 
ture bundle obtained in the initial cycle 1 of automated 
NOE assignment and structure calculation [37] with 



CYANA [17], i.e. all structures were recalculated using 
the experimental chemical shift lists, NOESY peak lists, 
and possible additional torsion angle or hydrogen bond 
restraints. This enabled comparisons of the CYRANGE 
ranges for two structure bundles of different precision 
and quality for each of the proteins. 

CYRANGE was also applied to a set of 26 NMR pro- 
tein structures, 23 of which had been used earlier for 
evaluating the FindCore algorithm [4]. In addition, the 
protein 2kr6 was included as an example of a protein 
with a large domain and a flexibly connected small helix 
that constitutes a separate domain [38] and 2ktf and 
2114 were included as examples of protein-protein com- 
plexes. We refer to these 26 proteins by their PDB 
codes. 

Application to all NMR structures in the PDB 

The entire set of NMR structures from the PDB [39] as 
available on July 30, 2010 was subjected to domain and 
residue range determination with the CYRANGE method, 
provided that the files contained at least five conformers 
and 15 amino acid residues. 

Additional material 



Additional file 1: Dependence of Q on the order parameter rank 

The quantity Q, is plotted against the order parameter rank / for 9 
different protein structure bundles. 

Additional file 2: Dependence of P on the clustering stage The 

quantity P f is plotted against the clustering stage / for 9 different protein 
structure bundles. 

Additional file 3: Dependence of CYRANGE results on the minimal 
cluster size parameter \i The sequence coverage (red) and RMSD 
(blue) of the residue ranges determined by CYRANGE were plotted as a 
function of fj for 9 different protein structure bundles. The dotted vertical 
line indicates the default value, /j = 8. Where CYRANGE found two 
domains, the RMSD values of the individual domains are shown in light 
and dark blue. 

Additional file 4: Dependence of CYRANGE results on the domain 
boundary extension parameter m. See Additional File 3 for details. 

Additional file 5: Dependence of CYRANGE results on the minimal 
gap width g. See Additional File 3 for details. 

Additional file 6: Dependence of CYRANGE results on the relative 
RMSD decrease parameter <5. See Additional File 3 for details. 

Additional file 7: Dependence of CYRANGE results on the absolute 
RMSD decrease parameter S abs . See Additional File 3 for details. 

Additional file 8: Dependence of CYRANGE results on the gap 
penalty parameter y See Additional File 3 for details. 

Additional file 9: Correlation between the sequence coverage from 
CYRANGE, FindCore and PSVS, and the GDT total score, GDT_TS 

Each data point represents a protein shown in Figures 3 and 4. The 
coverage is the percentage of amino acid residues included in the 
residue ranges found by the different methods. The GDT_TS value is 
defined by GDT_TS = (P, + P 2 + P 4 + P 8 )/4, where P d is the fraction of 
residues that can be superimposed under a distance cutoff of d A. 

Additional file 10: Correlation between the RMSD value for the 
residue ranges from CYRANGE, FindCore and PSVS, and the GDT 
total score, GDT_TS. Each data point represents one protein domain. 
See Additional File 9 for details. 
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